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ABSTRACT 

This study examined how four commonly used test 
equating procedures (linear, equipercentile , Rasch Model , and 
three-parameter) would respond to situations in which the properties 
or the two tests being equated were different. Data for two tests 
plus an external anchor test 4 were generated from a three parameter 
model in which mean test differences in difficulty, discrimination, 
and lower asymptote were manipulated. In each case two data sets were 
generated consisting of responses of 2,000 examinees to a 35 item 
test plus 15 item anchor test. Each test equating case was comprised 
of 4,000 examinees and &5._ items. The robustness with respect to 
violations of assumptions was tested for the linear and Rasch 
equating methods. For equipercentile equating, the results showed how 
the method responded to various conditions, for the three-parameter 
model, the study primarily tested LOGIST's simultaneous estimation 
procedure. Results indicated that equipercentile equating was very 
stable across the cases studied. Linear and Rasch model equating were 
very 1 sensitive to violations of their models' assumptions. Results 
for the three-parameter model were disappointing. Paying close 
attention to test item properties was advised when selecting an 
equating method. (BS) 
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AK EXPLORATIOK OF THE ROBUSTNESS OF FOUR TEST EQUATING MODELS 

% Gary Skeggs 

Robert W. Liaeiu 
University of Maryland 

The application of item response theory (IRT) to many measurement problems has 
been one of the major psychometric breakthroughs of the past twenty years. IRT 
; methodology is currently being used in many large standardized testing progress, and 
/ at the sane tine, a great deal of research is being done to evaluate the robustness 
I of the procedures under a variety of conditions. One of the most important 
/ applications of this model is in the area of test equating. 

The purpose of test equating is to determine the relationship between raw 
scores on two tests that measure the same ability. Equating can be horizontal, 
between tests of equivalent difficulty and content,, or vertical, between tests of 
intentionally different difficulties. 

Most conventional approaches to test equating have been describee as either 
linear or equipereentile methods (see Angoff , 1971) in contrast to using item 
response theory (IRT) (Lord, 1980; Wright 6 Stone, «1979) and are now widely used. 
In these techniques, a mathematical relationship between raw scores on tests is 
modeled. This relationship is based on estimates of item parameters from two tests^, 
and placement of these estimates" on the same scale. 

A number of equating studies using IRT methods have appeared 4n recent years. 
A majority of these have dealt with the one-parameter logistic, or Reach, model. 
This model is the simplest of the IRT models but also the most demanding one in 
terms of its assumptions. 

Several studies have found the Raseh model to be useful and appropriate for 
item calibration and linking (Tinsley & Davis, 1975; Rentz 6 Bashaw, 1977; Guskey, 

1981; Forsyth, Saisangjan, & Gilmer, 1981). On the other hand, a number of 

% 

researchers have noticed problems with the Raseh model for vertical equating 
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CWhitely £, Dawis, 1974; Slinde & Linn, 1978 197?, Loyd & Hoover, i960: Holmes, 
1962). 

Some of the inconsistency can be attributed to different equating^designs, 
different types of tests being equated, and different procedures used to analyze the 
results. Regardless of the cause, there are some fundamental concerns about the 
Rasch model. The most frequently postulated arguments concern the failure of the 
model to account for chance scoring, unequal discrimination, end 
multidimensionality. The last concern applies to more complex IRT models as well. 

Research using the three-parameter logistic and other models has not been as 
plentiful as with the Rasch model, but it suggests the same interpretative 
difficulties. Most of the studies have examined the three-parameter model in the 
context of horizontal equating of general ability tests. This work has generally 
supported the use of the three-parameter model CMarcho, Petersen, & Stewart, 1979; 
Kolen, 1961; Petersen, Cook, 6 Stocking, 1961). For vertical equating, results have 
been more mixed with some "studies finding the there parameter model to be more 
effective than the Rasch model (Marco et el., 1979; Kolen, 1981). However, the 
comparison between the models has been shown to depend largely on the content of the 
tests being equated (Kolen, 1981; Petersen et el., 1981; Holmes fc Doody-Bogan, 
1963). 

With all of this conflicting research, it is very difficult to make decisions 
about how to use IRT equating or whether to use it at ell. The purpose of the 
present study is to explore how test equating results can be effected by the 
parameters of the items that make up the tests being equated. 

Four methods of equating were chosen representing, popular versions of linear* 

f 

equipercentile, Rasch model, and threQ-parameter model techniques. Data for two 
tests plus an external anchor test were generated from a three-parameter model in 
which mean test differences in difficulty, discrimination, and lower asymptote were 



manipulated. For Resch model and linear equating, this study is an exploration of 
robustness when the model's assumptions are violated. For the three-parameter 
model, this study amounts to an examination of the parameter estimation strategy. 
For the equipercentile equating, this study explores its effectiveness under a 
variety of test conditions. 

METHOD 

Data * 

Data in this study were generated from the three-parameter logistic model: 

. y . ' 

P(u ■* 1/6, a ,b ,c ) = c ♦ (1-c XI * exp(-1.702a (6 -b >>> m 
*3 Dili i i. iji 

The response to item i by person 3, a 0 or 1, was determined by comparing the ^ 
probability defined by equation 1 to a random number drawn from a (0,1) uniform 
distribution. If the probability of a correct response exceeded the random number, 
the item was scored as correct. Otherwise, the item was scored, as incorrect . The 
random numbers were produced from the GGUBS (IKSL, 1980) generator. 

In ell simulation cases, an external anchor test design was used. Irt each 
case, two data, sets were generated. Each data set consisted of the responses of 
2,000 examinees to a 35 item test plus an anchor test of 15 items.* Each test 
equating case was comprised of '4, 000 examinees and 85 items.. This size was chosen 
to be large enough to provide stable parameter estimates for both IRT models 
(Lieaak, Hulin, & Drasgow, 1962). 
Item and Ability Parameters 

The item parameters used to generate the data were determined by manipulating 
lower asymptotes and mean test difficulty and discrimination of the tests being 
equated. For each reference, the two tests will be referred to as test A and test 
B. 
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* Item difficulty was studied at three levels: 1) 3^ = "5b = oK?) "b* = -.5,T« 

- • — \ * 

.5; ana d A = -l.o, db 8 1.0. For each test, difficulties were uniformly 

distributed .across e range of «V- 2 logits. 

Item discrimination was also examined at three levels: 1) "&a s ""5b * .eK2)"a£ 
= .5, cb s 1.1; and 3) oa * 1.1, as 8 .5. In each test, discriminations were 
uniformly distributed across a range of * - .1. Difficulties end discriminations 
were randomly paired. 

Lower asymptote values were manipulated in four ways: 1) ca * cb c 0; 2) ca ■ 
cb « .2: 3) ca c 0, cb * .2, and 4) c A * .2, cb = 0. In each case, lower asymptote 
values were the same for all items within a test. I 

In this study, a complete crossins of all levels produced 36 cells, or cases, 
of pairs of tests to be equated. All anchor test items had a mean difficulty of 
zero *nd a mean discrimination of .8. For vertical equating., the anchor items 
represented an overlap in difficulty between tests A and B. Lower asymptote values 
were all 2ero except in the case where the values were .2 for both tests. In these 
cases, lower asymptotes were .2 for the anchor test items. 

These item parameters were chosen to reflect a typical test equating between 
tests of either equal or unequal difficulty (to simulate horizontal or vertical 
equating). The abilities of the examinees was chosen to match that of each test's 
difficulty, an ideal situation and one found commonly in achievement testing. Each 
sample of 2,000 examinees was selected from a normal distribution with. a mean equal 
to "the mean dif f iculty ~of the test and a standard deviation of one. The GGNHL 
(IWSL, 1980) generator was used to generate ability parameters for each sample. 

Equating Methods 

One linear, one equipercentile, and two item response theory (IRT) equating 
methods were chosen foir this study on the basis of their popularity. In ell cases, 
an external anchor test design was used, and Test B was equated to Test A. That is. 



ERLC 



for each raw score on Test B, an equivalent was found on the raw score scale of Test 
A. For vertical equating,' Test B was always the more difficulty test. 

The linear equating. method has been described by Angoff (1571) as Design XVC-1 
and is a procedure derived by Levine (1955) for equally reliable tests. 
Equipercentile equating was accomplished using* Levine's (1958) method which has been 
described by Angoff as Design IVB. Cureton & Tukey's (1951) rolling weighted 
average method was used to smooth the cumulative distributions. 

One of the IRT equating methods is based on the Rasch model. Item parameters 
were estimated using BICAL (Wright, Mead, & Bell, 1980), and the equating was done 
using procedures outlined by Wright 6 Stone (1979). 

For three-parameter model equating, parameter estimates were obtained from 
LOGIST V (Wingersky, Barton, 6 Lord, 1982). Many versions of this program exist. 
For this study, a version adapted by ETS for a DSIVAC 1100 was used. For each 
equating ease* item parameters for both tests and the anchor test were estimated 

* ■ 

simultaneously by employing LOGIST's option , for not reached items/ The equating 
then followed J.ord's (1980) estimated true score equating procedure. For below 
•chance raw. scores, LorC/s (1980, p. 210* method of linear extrapolation was used. 
Analysis Procedures * 

Since the data were generated from a known three-parameter model, these initial 
item parameters were used to develop a criterion for the test equating cases. This 
criterion was simply a pairing of raw scores corresponding to the same ability 
estimates: 

TJ A - 2 P £ (6) , ^, - 2 Pj (6) 

This equating function was then compared to the equating functions produced by 
the four equating methods. Besides plotting these results, two summary statistics 
were used to interpret the results. These statistics are very similar to mean 
square error statistics used in other equating studies, (e.g. Marco, Petersen, t 



Stewart. 1979; Petersen, Cook, 6, Stocking, 1981). These indices are referred to 
here as the weighted and unweignted mean square error (MSE) and can.be stated as 

follows: ■ , 

1 k 
weighted (MSE) = j f (X - X .)*/ J f.S . 

k " 2 2 2 

unweighted (MSE) « X (X_ - X ) /s 
i»l E crit b 

2 " 

where k equals the number of items on Test B, 5b equals the raw score variance for * 
Teat B, X cr j.t is the criterion test score equivalent on Test A for raw score i on 
Test B, X e is the Test A equivalent for raw score i that is produced by one of the 
equating methods, and fi is the frequency of raw score i on Test B. The suaaation 
is over raw score values, except that for the weighted MSE, the suaaation is only 
across that part of the scale where extrapolation was not necessary. Zero and 
perfect scores were excluded from all XRT equating*, but included in both 
conventional equatinga. 

RESULTS 

■ ' ^* 

Raw score means and standard deviations for all data sets are shown in Table 1. 
Row score means ranged from approximately 17.5 to 21,3 and standard deviationa from 
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5.0 to 7.7. By looking at data sets generated under similar item parameters, it is 
clear that the generation procedure produced very consistent results. An 
examination of the frequency distributions «£or each data set also revealed a high 
degree of consistency in the shapes of the distributions. This in turn suggests 
that there was a high degree of stability in the equatings. 

As expected, a higher degree of item discrimination in the generating item 
parameters produced more dispersion in the raw score distributions. The reverse was 
true for low discrimination. Non-zero lower asymptotes produced negatively skewed 
raw score distributions. 

The summaries of the two me&n square error indices are presented in Tables 2 
and 3. The first case, where test difficulties and discriminations were equal and 
lower asymptotes were zero, was a situation where the data fit the Rasch model. 
From a psychometric poijrt of view, too, this represented an ideal (easy) equating 
situation. 1 The best result for all four methods* almost perfect equating, was 
found in this situation. The worst resuts for all methods occurred where mean test 
difficulties and discriminations wer6 unequal, where levels of chance scoring were 
unequal, and where low discrimination was paired with non-zero lower asymptotes on ' 
the, more difficult test. 

In general, the error indices for the equipercentile method wfxe the lowest 
across all cases. This was followed by the three-parameter model. Values for the 
Rasch-model and linear equating tended, naturally, to be relatively large in 
situations where their assumptions were violated. 

To aid in the interpretation of Tables 2 and 3 , repeated measures 
analyses-of-variance were performed on the two MSE indices. The results for the 
weighted USE appear in Table 4 and for the unweicfftted MSE in Table 5. All effects 
involving a comparison of the four methods were significant. Figures 1 to 4 show 
cell plots of all means for the first and second order interactions between the four 



equating methods and independent veriables* In each plot, the values shown are 
means pooled across the variable (s) not included in the plot. 

Finally, the actual equating functions for each case are shown in Figure 5, In 
each plot, the solid line represents the criterion equating based ^on the initial 
item parameters. The four broken lines represent the result^ from the four equating 
procedures. The criterion equating in most cases was curvilinear , making linear* 
equating clearly inappropriate. In most of the plots, Resch and linear equating 
were the most deviant, while the equipercentile equating line was closest to the 
criterion, thus visually confirming the USE values in Tables 2 and. 3. . 

DISCUSSION 

From a statistical viewpoint, the robustness with respect to violations of 
assumptions was tested in this study only for the linear and Resch equating methods. 
For equipercentile equating, the results showed how the method responded to a 
variety of conditions. . In the case" of the three-parameter model, this study was 

primarily a test of LOGIST's simultaneous estimation procedure. 

■* 

Linear Equating 

The assumptions of the linear equating model are violated whenever the shapes 
of the raw score distributions differ for the two tests being equated. This 
occurred when the mean test discrimination and/or level of lower asymptotes differed 
between the two tests. The appropriateness of linear equating could be gauged by 
the degree of curvilinearity in the criterion equating function. The total error 
for linear equating was the smallest for horizontal equating with equally 
discriminating tests. Chance scoring did not affect the equating in these cases 
since the criterion equating function was still linear. Linear equating was clearly 
inappropriate for all vertical equating cases and for horizontal equating where mean 
test discriminations were unequal. 

10 



Eeuipereentile Equating 

As can be seen in Tables 2 and 3. equating error for equipercentile equating 
was generally the lowest of 1 all four methods. All USE values were below .25. and it 
provided the smallest values -in 30 of the 3&" cases end in all the vertical equating 
cases. In the most extreme situations, this method was the only one of the four to 
produce whet we would consider acceptable result*. Perhaps one reason for this is 
that il is only one of the four approaches not based on a model. It is simply the 
best fit of the data at hand. The issue of cross-validation might be important in 
some situations, but in this case, our preliminary work (not reported here) shows 
that the results arlfc very stable. 

In this version of equipercentile equating, a total group cumulative 
distribution was estimated for both tests based on the response of the combined 
_ samples to the anchor test items. That this estimation in conjunction with a 
smoothing routine worked so well was somewhat surprising. 

Rasch Model Equating 

An examination of the results in Tables 2 and 3 suggests that .the Rasch model ' 
was not very robust to violations of its assumptions. The first case in those 
tables and in Figure 5. shows a situation vheire the data fit the Rasch model for 
horizontal equating. The Rasch model, as expected, performed extremely well as did 
the other three methods. In the second case in the tables and in Figure 5, all 
items had a lower assymptote of .2. Yet, the equating was still quite good for all 
methods • 

In subsequent cases, where the level SI chance scoring was unequal in the two 
tests and where test discriminations were unequal, the Rasch model performed very 
poorly. In situations where low discrimination was paired with non-zero lower ' 
asymptotes the total error was relatively large. An explanation for these results 
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can be found by looking at the estimation and linking procedure. When the SI^AL 
program is faced with a data set, a metric is chosen so that mean difficulty equal 
to zero and all discriminations equal to one. When BICAL runs are done for two 

tests with different properties, the resulting metrics "are different, and estimated 

i 

item difficulties for one test are more or less compressed than they should* be. The 
use of an equating constant does not e,lter the .underlying metric, and a bias is 
introduced into the equating. 

0 

t That this bias can be severe can be seen in the next to last plot in Figure 5 
in the case whefe low discrimination and chance scoring $re both^ present in the more 
difficult test. The BICAL estimates for this case revealed a range of difficulty of 
-4.0 to 3.1 for\Test A, but for Test B, the range was -1.5 to 1.2. Both tests were 
generated with a range of */-2 logits. Obviously the metrics are quite different. 

For vertical equating, where the data for eath test fit the Rasch model, Rasch 
equating produced adequate results where the test difficulties differed by one 
logit. However, where tests differed by two logits in difficulty, the equating was 
not as good. One possible reason for this involves the anchoring procedure. Since 
the anc^pr^tems represented an overlap in the difficulty ranges of both 
te£ts, these iteWwer^ very difficult for those taking Test A and very easy lor 
those taking Test B. Consequently, estimation was not as accurate. An anchor test 
with a wider range df difficulty (see Loyd, 1983) might have alleviated this 
problem. 

In the vertical equating cases; the error introduced by unequal, mean 
discrimination and chance scoring was even more pronounced. Even where the same 
degree of chance -scoring occurred on the two tests, Rasch equating was clearly 
inadequate. These results therefore corroborate from a different methodological 
perspective, empirical results that advise against using the Rasch model to equate 
vertically whenever chance scoring is a possibility. These results also advise 



N egeinst using the Rasch model in any situation when mean test discriminations are 

4 

unequal. " ' „ A ° 

> * * 

These problems would be especially difficult to overcome when one is 
constructing alternate forms from an -item bank. . To ensure that all tests formed 

* # 

from the bank had the same mean discrimination, all items in the bank would have to 
have the same discrimination, or a. complicated algorithm for item selection be 
programmed. Similarly^ chance scoring would not be a problem only if all items had 
the same .degree of chance scoring and the forms to be equated were of comparable 
difficulty. This is a difficult task for any test developer. 

Three pa rameter Model Eouatinc 

Since the data were generated from a three*parameter model, one would expect 
three-parameter model estimation and equating to be quite accurate. An examination 
of the values in Tables 2 and 3 indicates that this was not always the case. 

The .plots in Figures 1 to 4 suggest that three-parameter equating was 
relatively unaffected by levels of test difficulty or chance scoring. However, the 
equating was affected by unequal discrimination. There was also an interaction 
between unequal discrimination and test difficulty and chance scoring. The greatest 
errors occurred where .thefmore difficult test also had the lower discrimination and 
where a higher degree of jrhance scoring was paired with lower discrimination. 

Since the data actually fit the model, the LOGIST estimation procedure as 
programed should be held responsible for the success of the equating. In this 
study, simultaneous estimation was used. A single LOGIST run was used for each test 
equating by employing the "not reached" option. In every case for this study, the 
LOGIST estimation converged. However, for unequal discrimination and chance 
scoring, the program typically took at least 35 stages to converge (For practical 



reasons, it was decided to extend the limit on the number of stages rather than 
produce coniinuation runs). 

L 

That L0GI5T was_ unable to recover the initial metric can be illustrated by 
looking at jhe parameter estimates from one of the cases. For the situation where 
the^initiell parameters for tests A and B were as follows: 5a«-.5, ttA«l.l# ca*.0 and 
*B'-5, as=.£> cb 5 .2j the weighted end* unweighted KSE's were .616 and .563, 
respectively.' For Test A, the LOGIST difficulty estimates ranged from -3.14 to 
1.72, while the original difficulties ranged from -2.5 to 1.5. However, by linearly 
transforming* the LOGIST estimates to the original metric, the estimations ranged 
from -4.03 to 2.52. The LOGIST discriminations for Test A ranged from .6 to 1.3 
compared to the original 1.0 to 1.2. After transformation, the range becomes .6 to 
.9. 

For Test B, the LOGIST difficulties after transformation ranged from -1.37 to 
2.55. The original span was from -1.5 to 2.5. However, the difficulties were 
poorly estimated for the easiest half of the test. The LOGIST discriminations after 
transformation ranged from .5 to 1.0, the original range being .4 to .6. 

Ironically, the lower asymptotes were estimated reasonably well. The default 
options were used, and default values were obtained for the easiest items on both 
tests. Yet, no item had a c~value greater than .1 on Test A, and only six items on 
Test B had c values less than .1. 

Another peculiarity was observed in the LOGIST results across all cases. On 
each test, parameters for a few items (one to three out of 35) were estimated 
extremely poorly. These tended to occur more frequently on tests with weaker 
discriminations. No apparent reason for these outliers could be found as all item 
responses were generated from the same function. Still, an erroneous decision on 
the quality of an item could be made from these results. 
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Clearly, in cases of unequal discrimination, LOGIST was unable to reproduce zhe 
original metric, and equating was therefore biased. The differences in 
discrimination were quite severe in this study, and it is not ; known how well LOGIST 
1 would respond to milder differences. On the other hand, the parameters for this 
study — sample size, test length, and ability distribution — were chosen to yield 
stable, reprodueeable estimates. The results suggest therefore that the use of the 
simultaneous estimation procedure p£ LOGIST is questionable in circumstances such as 
these. Some other method for transforming estimates to the same scale should be 
considered. 

t 

Comments on Analysis Procedures 

A re*ie#_of published equating studies reveals a wide variety of evaluation 
procedures and summary statistics. The degree to which methodology affected 
conclusions is not known in these studies. In this study, the weighted mean square 
error statistic was chosen because it has appeared frequently in the literature 
(e.g. Karco et el., 1979; Petersen et el., 1981). When the results from these 
statistics were compared to graphs of the equating functions (Figure .5)', the 
weighted MSE values did not seem to represent some of the cases accurately. This 
was because there were relatively few persons in the raw score ranges .where the 
greatest equating errors occurred, at the lower end of the distribution*. ' 

Because of this, the unweighted USE was also computed. Because each raw score 
counted equally with this' statistic,' the values tended to be higher than for the 
weighted statistics. In Figures 1 to 4, the weighted BSE values appear on the left 
hand side and the unweighted values on the right. A comparison of the two sets of 
plots suggests that the two sets of MSE values turned out to be very similar for 
equipercentile and three-parameter model methods. For Linear and Rasch model 
equating, quite different results appeared in some of the plots. * 
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Other abnormalities appear when looking at the plots in Figures l to 4, For 
example,, in Figure 1, the symmetry in the study's design does not appear in th KS£ 
values for levels of the discrimination and lower asymptote independent variable. 
Tests A and B alternate between mean discriminations of .5 and 1.1 and lower 
asymptotes of 0 and .2. Yet, the higher USE values occur when Tfcst A has the higher 
discrimination. The same thing occurs when there is more chance scoring on Test B 
than on Test A. If one examines symmetrical designs (third, fourth, fifth, and 
ninth plots) in Figure 5, the plots appear to be mirror images of one another. 

The paradox can be exp&ined by the fact, that the MSE statistic uses vertical 
distances from the plots. If horizontal distances were used (i.e. Test A equated to 
Test B>, the pattern of results would be reversed. This analysis calls into 
question the use of KSE statistics for this purpose. A great deal of theoretical 
statistical work is needed in the area of proper error indexing. 

CONCLUSIONS 

The purpose of this study was to examine how four commonly used test equating 
procedures would respond to situations in which the properties of the two tests 
being equated were different. The results indicated that equipercentile equating 
was very stable across the cases studied. Linear and Rasch model equating were very 
sensitive to violations of their models 9 assumptions. Rasch model equating showed 
robustness only for horizontal equating where the degree of chance scoring was the 
same for both testa. 

When data fit the Rasch model, three-parameter model equating and Rasch 
equating achieved comparable and accurate results. In all other cases, 
three-parameter equating was far better than the Rasch model but generally not as 
good as equipercentile equating. The results for three-parameter model equating 
were disappointing since the data were generated from a three-parameter model. 



Simultaneous estimation using LOGIST seemed unable to recover the original metric, 
especially when mean test discriminations were unequal. 

All of the equating methods were affected by some situations. Where the tests 
being equated- differed in difficulty, mean discrimination, and in their .degree of 
chance scoring, the equating , error was the largest for all four methods. This 
suggests that equating tests should not be' attempted under such extreme conditions. 
None of the equating methods could completely overcome the effect of such divergence 
in item type. 

The use of the MSE statistics produced several paradoxes in the results. 
These could be resolved by examining the equating functions themselves. Certainly, 
more statistical work needs to be done in the comparison of test characteristic 
curves. 

Finally, all of the data for this ''study were generated from a unidimensional 
three-parameter model. Real data do hot exactly conform to this model, although it 
seems reasonable in a wide variety of situations. How these methods would respond 
to multidimensional data is not known, but problems for both IRT methods were 
uncovered in the unidimensional case. 

This study supported other*Vesearch finding's that found the Rasch model 
inappropriate for use in vertical equating situations. The three -parameter model 
procedure used here also did not generally produce acceptable results in more 
complex situations where we might have expected it to do so. The best advice at 
this point would seem to be to pay very close attention to the propertleVof the 
test items. If the tests differ very much in their properties, then classic 
equipercentile equating is suggested. 
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Table 4 

Analysis of Variance of Unweighted Mean Square Error 
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Table 5 



Analysis of Variance of weighted mean sguare error 
Source df MS p 
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TEST I 
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Test A: 5 * .. 5; S = 1.1; c « .2; 9 * -.5 
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Test A: 5 = -1.0; S * .5; c « .2; 5 = -1.0 
Test B: b * 1.0; a « 1.1; c = .2; 6 « 1.0 
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TEST I 

Test A: 5 - -1.0; S » .5; c = .2; S « -1.0 
Test B: 5 = 1.0; a = 1.1; c « .0;,6 »_ 1.0 
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test I 

Test A: 5 * -1.0; 5 * 1.1; c ■ .0; B ■ -1.0 
Test B: 5*« 1.0; a ■ .5;"c « .2; 0 ■ 1.0 




•* encia 



1 _ 

••— • • itagococTTK 



m * | * i t > 1 1 u i | ri i 1 1 1 i m 1 1 n 1 1 1 1 1 1 1 1 i 1 1 

e.ee s.ee ie.ee is.ee 2e.ee 2S.ee 3e.ee ss.ee xe.ee 

TEST 8 

Test A: E - -1.0; S - 1.1; c « .2; 6 • -1.0 
. Test B: 5 = 1.0; 5 » .5; c * .0; 8 ■ 1.0 
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